Basic Explanation

It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.

In this database, you have 3 different outputs:

1. No readmission;
2. A readmission in less than 30 days 
3. A readmission in more than 30 days 

About the Data

"The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

It is an inpatient encounter (a hospital admission).
It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
The length of stay was at least 1 day and at most 14 days.
Laboratory tests were performed during the encounter.
Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc."

Source

The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO). Original source of the data set

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Motivation

Motivation behind this solution is following:

1. To create a model to predict readmission, so that the patient can get a better treatment.
2. Diabetes is a world-wide problem, and we should try to understand the cause & factors of Diabetes.
3. This is a try to implement modelling as well as explanation of the models.

Note: This might not be a perfect solution, as this has been created with limited time and effort, so please don't use it for medical purposes, but yes this can be used for educational purposes.

Data Exploration

In [1]:
import pandas as pd
import numpy as np

# Warnings
import warnings
warnings.filterwarnings(action='ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score,roc_auc_score

# model explainers
import lime
from lime.lime_tabular import LimeTabularExplainer
import eli5
from eli5.sklearn import PermutationImportance
import shap
from shap import TreeExplainer,KernelExplainer,LinearExplainer
shap.initjs()

import time
import datetime
import platform
start = time.time()

warnings.simplefilter('ignore')
Using TensorFlow backend.
In [2]:
# Create table for missing data in the given dataframe.
def draw_missing_data_table(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data
In [3]:
dataset = pd.read_csv('data/diabetic_data.csv')
In [4]:
# Looking at the dataset
dataset.head().T
Out[4]:
0 1 2 3 4
encounter_id 2278392 149190 64410 500364 16680
patient_nbr 8222157 55629189 86047875 82442376 42519267
race Caucasian Caucasian AfricanAmerican Caucasian Caucasian
gender Female Female Female Male Male
age [0-10) [10-20) [20-30) [30-40) [40-50)
weight ? ? ? ? ?
admission_type_id 6 1 1 1 1
discharge_disposition_id 25 1 1 1 1
admission_source_id 1 7 7 7 7
time_in_hospital 1 3 2 2 1
payer_code ? ? ? ? ?
medical_specialty Pediatrics-Endocrinology ? ? ? ?
num_lab_procedures 41 59 11 44 51
num_procedures 0 0 5 1 0
num_medications 1 18 13 16 8
number_outpatient 0 0 2 0 0
number_emergency 0 0 0 0 0
number_inpatient 0 0 1 0 0
diag_1 250.83 276 648 8 197
diag_2 ? 250.01 250 250.43 157
diag_3 ? 255 V27 403 250
number_diagnoses 1 9 6 7 5
max_glu_serum None None None None None
A1Cresult None None None None None
metformin No No No No No
repaglinide No No No No No
nateglinide No No No No No
chlorpropamide No No No No No
glimepiride No No No No No
acetohexamide No No No No No
glipizide No No Steady No Steady
glyburide No No No No No
tolbutamide No No No No No
pioglitazone No No No No No
rosiglitazone No No No No No
acarbose No No No No No
miglitol No No No No No
troglitazone No No No No No
tolazamide No No No No No
examide No No No No No
citoglipton No No No No No
insulin No Up No Up Steady
glyburide-metformin No No No No No
glipizide-metformin No No No No No
glimepiride-pioglitazone No No No No No
metformin-rosiglitazone No No No No No
metformin-pioglitazone No No No No No
change No Ch No Ch Ch
diabetesMed No Yes Yes Yes Yes
readmitted NO >30 NO NO NO
In [5]:
dataset.describe(include = 'all').T
Out[5]:
count unique top freq mean std min 25% 50% 75% max
encounter_id 101766 NaN NaN NaN 1.65202e+08 1.0264e+08 12522 8.49612e+07 1.52389e+08 2.30271e+08 4.43867e+08
patient_nbr 101766 NaN NaN NaN 5.43304e+07 3.86964e+07 135 2.34132e+07 4.55051e+07 8.75459e+07 1.89503e+08
race 101766 6 Caucasian 76099 NaN NaN NaN NaN NaN NaN NaN
gender 101766 3 Female 54708 NaN NaN NaN NaN NaN NaN NaN
age 101766 10 [70-80) 26068 NaN NaN NaN NaN NaN NaN NaN
weight 101766 10 ? 98569 NaN NaN NaN NaN NaN NaN NaN
admission_type_id 101766 NaN NaN NaN 2.02401 1.4454 1 1 1 3 8
discharge_disposition_id 101766 NaN NaN NaN 3.71564 5.28017 1 1 1 4 28
admission_source_id 101766 NaN NaN NaN 5.75444 4.06408 1 1 7 7 25
time_in_hospital 101766 NaN NaN NaN 4.39599 2.98511 1 2 4 6 14
payer_code 101766 18 ? 40256 NaN NaN NaN NaN NaN NaN NaN
medical_specialty 101766 73 ? 49949 NaN NaN NaN NaN NaN NaN NaN
num_lab_procedures 101766 NaN NaN NaN 43.0956 19.6744 1 31 44 57 132
num_procedures 101766 NaN NaN NaN 1.33973 1.70581 0 0 1 2 6
num_medications 101766 NaN NaN NaN 16.0218 8.12757 1 10 15 20 81
number_outpatient 101766 NaN NaN NaN 0.369357 1.26727 0 0 0 0 42
number_emergency 101766 NaN NaN NaN 0.197836 0.930472 0 0 0 0 76
number_inpatient 101766 NaN NaN NaN 0.635566 1.26286 0 0 0 1 21
diag_1 101766 717 428 6862 NaN NaN NaN NaN NaN NaN NaN
diag_2 101766 749 276 6752 NaN NaN NaN NaN NaN NaN NaN
diag_3 101766 790 250 11555 NaN NaN NaN NaN NaN NaN NaN
number_diagnoses 101766 NaN NaN NaN 7.42261 1.9336 1 6 8 9 16
max_glu_serum 101766 4 None 96420 NaN NaN NaN NaN NaN NaN NaN
A1Cresult 101766 4 None 84748 NaN NaN NaN NaN NaN NaN NaN
metformin 101766 4 No 81778 NaN NaN NaN NaN NaN NaN NaN
repaglinide 101766 4 No 100227 NaN NaN NaN NaN NaN NaN NaN
nateglinide 101766 4 No 101063 NaN NaN NaN NaN NaN NaN NaN
chlorpropamide 101766 4 No 101680 NaN NaN NaN NaN NaN NaN NaN
glimepiride 101766 4 No 96575 NaN NaN NaN NaN NaN NaN NaN
acetohexamide 101766 2 No 101765 NaN NaN NaN NaN NaN NaN NaN
glipizide 101766 4 No 89080 NaN NaN NaN NaN NaN NaN NaN
glyburide 101766 4 No 91116 NaN NaN NaN NaN NaN NaN NaN
tolbutamide 101766 2 No 101743 NaN NaN NaN NaN NaN NaN NaN
pioglitazone 101766 4 No 94438 NaN NaN NaN NaN NaN NaN NaN
rosiglitazone 101766 4 No 95401 NaN NaN NaN NaN NaN NaN NaN
acarbose 101766 4 No 101458 NaN NaN NaN NaN NaN NaN NaN
miglitol 101766 4 No 101728 NaN NaN NaN NaN NaN NaN NaN
troglitazone 101766 2 No 101763 NaN NaN NaN NaN NaN NaN NaN
tolazamide 101766 3 No 101727 NaN NaN NaN NaN NaN NaN NaN
examide 101766 1 No 101766 NaN NaN NaN NaN NaN NaN NaN
citoglipton 101766 1 No 101766 NaN NaN NaN NaN NaN NaN NaN
insulin 101766 4 No 47383 NaN NaN NaN NaN NaN NaN NaN
glyburide-metformin 101766 4 No 101060 NaN NaN NaN NaN NaN NaN NaN
glipizide-metformin 101766 2 No 101753 NaN NaN NaN NaN NaN NaN NaN
glimepiride-pioglitazone 101766 2 No 101765 NaN NaN NaN NaN NaN NaN NaN
metformin-rosiglitazone 101766 2 No 101764 NaN NaN NaN NaN NaN NaN NaN
metformin-pioglitazone 101766 2 No 101765 NaN NaN NaN NaN NaN NaN NaN
change 101766 2 No 54755 NaN NaN NaN NaN NaN NaN NaN
diabetesMed 101766 2 Yes 78363 NaN NaN NaN NaN NaN NaN NaN
readmitted 101766 3 NO 54864 NaN NaN NaN NaN NaN NaN NaN
In [6]:
dataset.replace('?',np.nan,inplace=True)
draw_missing_data_table(dataset)
Out[6]:
Total Percent
weight 98569 0.968585
medical_specialty 49949 0.490822
payer_code 40256 0.395574
race 2273 0.022336
diag_3 1423 0.013983
diag_2 358 0.003518
diag_1 21 0.000206
num_procedures 0 0.000000
max_glu_serum 0 0.000000
number_diagnoses 0 0.000000
number_inpatient 0 0.000000
number_emergency 0 0.000000
number_outpatient 0 0.000000
num_medications 0 0.000000
readmitted 0 0.000000
num_lab_procedures 0 0.000000
diabetesMed 0 0.000000
time_in_hospital 0 0.000000
admission_source_id 0 0.000000
discharge_disposition_id 0 0.000000
admission_type_id 0 0.000000
age 0 0.000000
gender 0 0.000000
patient_nbr 0 0.000000
A1Cresult 0 0.000000
metformin 0 0.000000
repaglinide 0 0.000000
nateglinide 0 0.000000
change 0 0.000000
metformin-pioglitazone 0 0.000000
metformin-rosiglitazone 0 0.000000
glimepiride-pioglitazone 0 0.000000
glipizide-metformin 0 0.000000
glyburide-metformin 0 0.000000
insulin 0 0.000000
citoglipton 0 0.000000
examide 0 0.000000
tolazamide 0 0.000000
troglitazone 0 0.000000
miglitol 0 0.000000
acarbose 0 0.000000
rosiglitazone 0 0.000000
pioglitazone 0 0.000000
tolbutamide 0 0.000000
glyburide 0 0.000000
glipizide 0 0.000000
acetohexamide 0 0.000000
glimepiride 0 0.000000
chlorpropamide 0 0.000000
encounter_id 0 0.000000

Feature Creation

Dropping following columns

1. Columns that have high missing values
2. ID related columns

Doing following transformations:

1. Fill Unknown for empty Race
2. Droppong rows that have NULL values

Performing Feature Encoding

In [7]:
dataset.drop(['weight','medical_specialty','payer_code'],axis=1,inplace=True)
# dropping columns related to IDs
dataset.drop(['encounter_id','patient_nbr','admission_type_id',
         'discharge_disposition_id','admission_source_id'],axis=1,inplace=True)
In [8]:
dataset['gender'].unique()
Out[8]:
array(['Female', 'Male', 'Unknown/Invalid'], dtype=object)
In [9]:
dataset['race'].unique()
dataset['race'] = dataset['race'].fillna('Unknown')
In [10]:
dataset['race'].unique()
Out[10]:
array(['Caucasian', 'AfricanAmerican', 'Unknown', 'Other', 'Asian',
       'Hispanic'], dtype=object)
In [11]:
dataset.dropna(inplace=True)
In [12]:
dataset['readmitted'] = dataset['readmitted'].apply(lambda x: 1 if x == 'NO' else 0)
In [13]:
import numpy as np
from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

def feature_encoding(df, label):
    
    cat_cols = list(df.select_dtypes('object').columns)
    
    for col in cat_cols:
        features_encoded = pd.get_dummies(df[col], prefix=col + '_is')
        df = df.join(features_encoded)
        print('Encoded column {}'.format(col))
        df.drop(columns=[col], inplace=True)
    
    total_columns = list(dataset.columns.values)
    continuous_features = [x for x in total_columns if x not in cat_cols]
    continuous_features.remove(label)
    
    for col in continuous_features:
        transf = df[col].values.reshape(-1,1)
        scaler = preprocessing.StandardScaler().fit(transf)
        df[col] = scaler.transform(transf)
    
    cols = list(df.columns.values)
    df = df[cols]
    
    return df
In [14]:
feature_dataset = feature_encoding(dataset, 'readmitted')
Encoded column race
Encoded column gender
Encoded column age
Encoded column diag_1
Encoded column diag_2
Encoded column diag_3
Encoded column max_glu_serum
Encoded column A1Cresult
Encoded column metformin
Encoded column repaglinide
Encoded column nateglinide
Encoded column chlorpropamide
Encoded column glimepiride
Encoded column acetohexamide
Encoded column glipizide
Encoded column glyburide
Encoded column tolbutamide
Encoded column pioglitazone
Encoded column rosiglitazone
Encoded column acarbose
Encoded column miglitol
Encoded column troglitazone
Encoded column tolazamide
Encoded column examide
Encoded column citoglipton
Encoded column insulin
Encoded column glyburide-metformin
Encoded column glipizide-metformin
Encoded column glimepiride-pioglitazone
Encoded column metformin-rosiglitazone
Encoded column metformin-pioglitazone
Encoded column change
Encoded column diabetesMed
In [15]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel


training_dataset = feature_dataset.copy()

y_train=feature_dataset[['readmitted']]
X_train=feature_dataset.drop(['readmitted'],axis=1)

### Apply Feature Selection

feature_sel_model = SelectFromModel(Lasso(alpha=0.001, random_state=0)) 
# remember to set the seed, the random state in this function
feature_sel_model.fit(X_train, y_train)

feature_sel_model.get_support()

feature_list = X_train.columns[(feature_sel_model.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(feature_list)))

print(feature_list)
total features: 2357
selected features: 44
Index(['time_in_hospital', 'num_lab_procedures', 'num_procedures',
       'number_outpatient', 'number_emergency', 'number_inpatient',
       'number_diagnoses', 'race_is_AfricanAmerican', 'race_is_Caucasian',
       'race_is_Unknown', 'gender_is_Female', 'age_is_[60-70)',
       'age_is_[70-80)', 'age_is_[80-90)', 'age_is_[90-100)',
       'diag_1_is_250.6', 'diag_1_is_428', 'diag_1_is_491', 'diag_1_is_493',
       'diag_2_is_250', 'diag_2_is_401', 'diag_2_is_403', 'diag_2_is_411',
       'diag_2_is_428', 'diag_2_is_496', 'diag_2_is_518', 'diag_2_is_585',
       'diag_2_is_707', 'diag_3_is_250', 'diag_3_is_250.6', 'diag_3_is_276',
       'diag_3_is_401', 'diag_3_is_403', 'diag_3_is_428', 'A1Cresult_is_None',
       'A1Cresult_is_Norm', 'metformin_is_No', 'glipizide_is_No',
       'pioglitazone_is_No', 'rosiglitazone_is_Steady', 'insulin_is_Down',
       'insulin_is_Steady', 'change_is_Ch', 'diabetesMed_is_No'],
      dtype='object')
In [16]:
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

color = sns.color_palette()
sns.set_style('darkgrid')


corrmat = training_dataset[list(feature_list) + ['readmitted']].corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25d5c780>
In [17]:
list(feature_list) + ['readmitted']
Out[17]:
['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'race_is_AfricanAmerican', 'race_is_Caucasian', 'race_is_Unknown', 'gender_is_Female', 'age_is_[60-70)', 'age_is_[70-80)', 'age_is_[80-90)', 'age_is_[90-100)', 'diag_1_is_250.6', 'diag_1_is_428', 'diag_1_is_491', 'diag_1_is_493', 'diag_2_is_250', 'diag_2_is_401', 'diag_2_is_403', 'diag_2_is_411', 'diag_2_is_428', 'diag_2_is_496', 'diag_2_is_518', 'diag_2_is_585', 'diag_2_is_707', 'diag_3_is_250', 'diag_3_is_250.6', 'diag_3_is_276', 'diag_3_is_401', 'diag_3_is_403', 'diag_3_is_428', 'A1Cresult_is_None', 'A1Cresult_is_Norm', 'metformin_is_No', 'glipizide_is_No', 'pioglitazone_is_No', 'rosiglitazone_is_Steady', 'insulin_is_Down', 'insulin_is_Steady', 'change_is_Ch', 'diabetesMed_is_No', 'readmitted']

Training / Test Split

(75%/25%)

In [18]:
X = feature_dataset.drop('readmitted',axis=1)[list(feature_list)]
y = feature_dataset['readmitted']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
X_train.shape,X_test.shape
Out[18]:
((75183, 44), (25061, 44))
In [19]:
list(X_train.columns.values)
Out[19]:
['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'race_is_AfricanAmerican', 'race_is_Caucasian', 'race_is_Unknown', 'gender_is_Female', 'age_is_[60-70)', 'age_is_[70-80)', 'age_is_[80-90)', 'age_is_[90-100)', 'diag_1_is_250.6', 'diag_1_is_428', 'diag_1_is_491', 'diag_1_is_493', 'diag_2_is_250', 'diag_2_is_401', 'diag_2_is_403', 'diag_2_is_411', 'diag_2_is_428', 'diag_2_is_496', 'diag_2_is_518', 'diag_2_is_585', 'diag_2_is_707', 'diag_3_is_250', 'diag_3_is_250.6', 'diag_3_is_276', 'diag_3_is_401', 'diag_3_is_403', 'diag_3_is_428', 'A1Cresult_is_None', 'A1Cresult_is_Norm', 'metformin_is_No', 'glipizide_is_No', 'pioglitazone_is_No', 'rosiglitazone_is_Steady', 'insulin_is_Down', 'insulin_is_Steady', 'change_is_Ch', 'diabetesMed_is_No']

Modelling

In [20]:
%%time
ML_models = {}
model_index = ['LR','RF','DT','NN']
model_sklearn = [LogisticRegression(solver='liblinear',random_state=0),
                 RandomForestClassifier(n_estimators=100,random_state=0),
                 DecisionTreeClassifier(),
                 MLPClassifier([100]*5,early_stopping=True,learning_rate='adaptive',random_state=0)]
model_summary = []
for name,model in zip(model_index,model_sklearn):
    ML_models[name] = model.fit(X_train,y_train)
    preds = model.predict(X_test)
    model_summary.append([name,f1_score(y_test,preds,average='weighted'),accuracy_score(y_test,preds),
                          roc_auc_score(y_test,model.predict_proba(X_test)[:,1])])
print(ML_models)
{'LR': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False), 'RF': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False), 'DT': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'), 'NN': MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=[100, 100, 100, 100, 100],
       learning_rate='adaptive', learning_rate_init=0.001, max_iter=200,
       momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
       power_t=0.5, random_state=0, shuffle=True, solver='adam',
       tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)}
CPU times: user 45.9 s, sys: 1.24 s, total: 47.2 s
Wall time: 31.8 s
In [21]:
model_summary = pd.DataFrame(model_summary,columns=['Name','F1_score','Accuracy','AUC_ROC'])
model_summary = model_summary.reset_index()
model_summary
Out[21]:
index Name F1_score Accuracy AUC_ROC
0 0 LR 0.610241 0.622561 0.663628
1 1 RF 0.601549 0.603049 0.639168
2 2 DT 0.544905 0.544871 0.542586
3 3 NN 0.624362 0.626312 0.668215
In [22]:
g=sns.regplot(data=model_summary, x="index", y="AUC_ROC", fit_reg=False,
               color="red", scatter_kws={'s':500})

for i in range(0,model_summary.shape[0]):
     g.text(model_summary.loc[i,'index'], model_summary.loc[i,'AUC_ROC']+0.02, model_summary.loc[i,'Name'], 
            horizontalalignment='center',verticalalignment='top', size='large', color='black')

Interpretations Using

1. LIME

2. ELI5

3. SHAP

In [43]:
test_row = pd.DataFrame(X_test.loc[200,:]).T
test_row
Out[43]:
time_in_hospital num_lab_procedures num_procedures number_outpatient number_emergency number_inpatient number_diagnoses race_is_AfricanAmerican race_is_Caucasian race_is_Unknown ... A1Cresult_is_None A1Cresult_is_Norm metformin_is_No glipizide_is_No pioglitazone_is_No rosiglitazone_is_Steady insulin_is_Down insulin_is_Steady change_is_Ch diabetesMed_is_No
200 -1.143423 0.041962 -0.789217 -0.292418 -0.213183 -0.506404 0.817053 0.0 0.0 0.0 ... 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0

1 rows × 44 columns

LIME

In [24]:
#initialization of a explainer from LIME
explainer = LimeTabularExplainer(X_train.values,
                                 mode='classification',
                                 feature_names=X_train.columns,
                                 class_names=['Readmitted','Not Readmitted'])
In [25]:
exp = explainer.explain_instance(test_row.values[0],
                                 ML_models['LR'].predict_proba,
                                 num_features=X_train.shape[1])
exp.show_in_notebook(show_table=True)
In [26]:
exp = explainer.explain_instance(test_row.values[0],
                                 ML_models['RF'].predict_proba,
                                 num_features=X_train.shape[1])
exp.show_in_notebook(show_table=True)
In [27]:
exp = explainer.explain_instance(test_row.values[0],
                                 ML_models['DT'].predict_proba,
                                 num_features=X_train.shape[1])
exp.show_in_notebook(show_table=True)
In [28]:
exp = explainer.explain_instance(test_row.values[0],
                                 ML_models['NN'].predict_proba,
                                 num_features=X_train.shape[1])
exp.show_in_notebook(show_table=True)

ELI5

In [29]:
eli5.show_weights(ML_models['LR'], feature_names = list(X_test.columns),top=None)
Out[29]:

y=1 top features

Weight? Feature
+0.397 diag_2_is_518
+0.310 diabetesMed_is_No
+0.295 <BIAS>
+0.270 race_is_Unknown
+0.253 age_is_[90-100)
+0.172 diag_2_is_401
+0.136 A1Cresult_is_Norm
+0.123 diag_3_is_276
+0.122 insulin_is_Steady
+0.121 diag_2_is_250
+0.088 glipizide_is_No
+0.078 diag_3_is_250
+0.074 pioglitazone_is_No
+0.073 num_procedures
+0.070 diag_3_is_401
+0.024 change_is_Ch
-0.026 time_in_hospital
-0.030 A1Cresult_is_None
-0.037 num_lab_procedures
-0.056 age_is_[80-90)
-0.057 gender_is_Female
-0.065 age_is_[60-70)
-0.070 insulin_is_Down
-0.075 diag_3_is_428
-0.098 metformin_is_No
-0.102 number_outpatient
-0.103 age_is_[70-80)
-0.106 number_diagnoses
-0.113 rosiglitazone_is_Steady
-0.152 race_is_AfricanAmerican
-0.187 diag_2_is_496
-0.197 diag_2_is_428
-0.197 race_is_Caucasian
-0.210 diag_2_is_585
-0.212 diag_2_is_411
-0.220 diag_2_is_707
-0.220 number_emergency
-0.248 diag_3_is_403
-0.421 diag_1_is_428
-0.421 number_inpatient
-0.441 diag_3_is_250.6
-0.441 diag_2_is_403
-0.487 diag_1_is_250.6
-0.487 diag_1_is_491
-0.550 diag_1_is_493
In [30]:
eli5.show_prediction(ML_models['LR'], test_row.values[0],feature_names=list(X_test.columns),top=None)
Out[30]:

y=1 (probability 0.695, score 0.824) top features

Contribution? Feature
+0.310 diabetesMed_is_No
+0.295 <BIAS>
+0.213 number_inpatient
+0.088 glipizide_is_No
+0.078 diag_3_is_250
+0.074 pioglitazone_is_No
+0.047 number_emergency
+0.030 number_outpatient
+0.030 time_in_hospital
-0.002 num_lab_procedures
-0.030 A1Cresult_is_None
-0.058 num_procedures
-0.065 age_is_[60-70)
-0.087 number_diagnoses
-0.098 metformin_is_No
In [31]:
exp = PermutationImportance(ML_models['LR'],
                            random_state = 0).fit(X_test, y_test)
eli5.show_weights(exp,feature_names=list(X_test.columns),top=None)
Out[31]:
Weight Feature
0.0487 ± 0.0023 number_inpatient
0.0077 ± 0.0019 number_emergency
0.0041 ± 0.0012 diag_1_is_428
0.0040 ± 0.0010 race_is_Caucasian
0.0038 ± 0.0018 number_outpatient
0.0032 ± 0.0017 number_diagnoses
0.0031 ± 0.0020 diabetesMed_is_No
0.0026 ± 0.0005 diag_2_is_403
0.0016 ± 0.0022 num_procedures
0.0016 ± 0.0007 age_is_[70-80)
0.0015 ± 0.0005 diag_1_is_491
0.0012 ± 0.0012 diag_2_is_428
0.0010 ± 0.0005 age_is_[90-100)
0.0009 ± 0.0012 glipizide_is_No
0.0009 ± 0.0010 diag_3_is_403
0.0008 ± 0.0004 diag_3_is_250.6
0.0008 ± 0.0010 diag_3_is_428
0.0007 ± 0.0008 time_in_hospital
0.0006 ± 0.0007 insulin_is_Steady
0.0005 ± 0.0011 gender_is_Female
0.0005 ± 0.0003 diag_2_is_250
0.0005 ± 0.0003 diag_2_is_518
0.0004 ± 0.0004 diag_2_is_585
0.0004 ± 0.0004 change_is_Ch
0.0004 ± 0.0004 diag_1_is_250.6
0.0004 ± 0.0003 diag_1_is_493
0.0003 ± 0.0009 diag_2_is_411
0.0003 ± 0.0014 num_lab_procedures
0.0003 ± 0.0004 diag_2_is_401
0.0003 ± 0.0008 age_is_[60-70)
0.0003 ± 0.0008 diag_2_is_707
0.0001 ± 0.0004 diag_3_is_276
0.0001 ± 0.0006 diag_3_is_401
0.0001 ± 0.0006 pioglitazone_is_No
0.0001 ± 0.0008 insulin_is_Down
0.0000 ± 0.0003 race_is_Unknown
-0.0000 ± 0.0004 rosiglitazone_is_Steady
-0.0000 ± 0.0005 A1Cresult_is_None
-0.0001 ± 0.0009 diag_2_is_496
-0.0001 ± 0.0004 diag_3_is_250
-0.0003 ± 0.0011 race_is_AfricanAmerican
-0.0003 ± 0.0006 A1Cresult_is_Norm
-0.0010 ± 0.0010 metformin_is_No
-0.0011 ± 0.0005 age_is_[80-90)
In [32]:
eli5.show_weights(ML_models['RF'],feature_names=list(X_test.columns),top=None)
Out[32]:
Weight Feature
0.2117 ± 0.0140 num_lab_procedures
0.1192 ± 0.0206 time_in_hospital
0.0742 ± 0.0205 num_procedures
0.0681 ± 0.0152 number_diagnoses
0.0588 ± 0.0077 number_inpatient
0.0337 ± 0.0070 gender_is_Female
0.0263 ± 0.0076 number_outpatient
0.0221 ± 0.0043 age_is_[60-70)
0.0217 ± 0.0068 insulin_is_Steady
0.0200 ± 0.0050 number_emergency
0.0194 ± 0.0090 age_is_[70-80)
0.0193 ± 0.0047 A1Cresult_is_None
0.0187 ± 0.0105 change_is_Ch
0.0176 ± 0.0086 metformin_is_No
0.0169 ± 0.0069 age_is_[80-90)
0.0169 ± 0.0045 glipizide_is_No
0.0166 ± 0.0061 race_is_Caucasian
0.0143 ± 0.0045 race_is_AfricanAmerican
0.0132 ± 0.0023 pioglitazone_is_No
0.0127 ± 0.0095 diag_3_is_250
0.0122 ± 0.0066 diag_3_is_401
0.0118 ± 0.0025 diag_3_is_276
0.0115 ± 0.0064 diag_2_is_428
0.0112 ± 0.0064 insulin_is_Down
0.0110 ± 0.0035 rosiglitazone_is_Steady
0.0103 ± 0.0031 diag_3_is_428
0.0097 ± 0.0060 diabetesMed_is_No
0.0090 ± 0.0020 diag_2_is_496
0.0088 ± 0.0049 diag_1_is_428
0.0083 ± 0.0046 diag_2_is_250
0.0080 ± 0.0023 A1Cresult_is_Norm
0.0068 ± 0.0036 diag_2_is_401
0.0066 ± 0.0014 diag_2_is_411
0.0066 ± 0.0018 age_is_[90-100)
0.0060 ± 0.0020 diag_2_is_707
0.0059 ± 0.0023 diag_3_is_403
0.0058 ± 0.0032 diag_2_is_403
0.0053 ± 0.0017 diag_2_is_585
0.0052 ± 0.0027 diag_1_is_491
0.0043 ± 0.0012 diag_2_is_518
0.0040 ± 0.0019 race_is_Unknown
0.0037 ± 0.0013 diag_3_is_250.6
0.0034 ± 0.0012 diag_1_is_493
0.0033 ± 0.0014 diag_1_is_250.6
In [33]:
eli5.show_prediction(ML_models['RF'], test_row.values[0],feature_names=list(X_test.columns),top=None)
Out[33]:

y=1 (probability 0.770) top features

Contribution? Feature
+0.537 <BIAS>
+0.065 number_inpatient
+0.063 time_in_hospital
+0.059 num_lab_procedures
+0.044 diag_3_is_250
+0.041 diabetesMed_is_No
+0.021 number_outpatient
+0.019 race_is_Caucasian
+0.014 number_emergency
+0.013 age_is_[80-90)
+0.010 diag_1_is_428
+0.009 diag_3_is_401
+0.008 gender_is_Female
+0.005 age_is_[70-80)
+0.005 change_is_Ch
+0.004 glipizide_is_No
+0.004 diag_2_is_428
+0.002 diag_1_is_491
+0.002 insulin_is_Down
+0.002 diag_2_is_585
+0.002 diag_3_is_403
+0.002 diag_2_is_403
+0.002 diag_2_is_411
+0.002 age_is_[90-100)
+0.002 insulin_is_Steady
+0.001 diag_1_is_250.6
+0.001 diag_3_is_250.6
+0.001 rosiglitazone_is_Steady
+0.001 diag_3_is_428
+0.000 diag_1_is_493
+0.000 diag_2_is_401
+0.000 diag_2_is_518
+0.000 diag_2_is_707
-0.000 pioglitazone_is_No
-0.001 diag_3_is_276
-0.001 diag_2_is_250
-0.003 A1Cresult_is_Norm
-0.003 diag_2_is_496
-0.004 race_is_Unknown
-0.005 num_procedures
-0.009 metformin_is_No
-0.014 race_is_AfricanAmerican
-0.016 A1Cresult_is_None
-0.044 age_is_[60-70)
-0.069 number_diagnoses
In [34]:
exp = PermutationImportance(ML_models['RF'],
                            random_state = 0).fit(X_test, y_test)
eli5.show_weights(exp,feature_names=list(X_test.columns),top=None)
Out[34]:
Weight Feature
0.0462 ± 0.0065 number_inpatient
0.0078 ± 0.0024 number_emergency
0.0054 ± 0.0014 number_outpatient
0.0035 ± 0.0010 diag_1_is_428
0.0029 ± 0.0022 number_diagnoses
0.0022 ± 0.0016 num_lab_procedures
0.0012 ± 0.0002 diag_3_is_428
0.0009 ± 0.0005 diag_2_is_518
0.0009 ± 0.0005 diag_2_is_403
0.0009 ± 0.0011 diag_2_is_428
0.0009 ± 0.0007 diag_1_is_491
0.0008 ± 0.0008 diag_3_is_403
0.0007 ± 0.0022 age_is_[70-80)
0.0006 ± 0.0006 diag_3_is_250.6
0.0006 ± 0.0032 diabetesMed_is_No
0.0005 ± 0.0009 diag_2_is_401
0.0005 ± 0.0007 diag_3_is_276
0.0004 ± 0.0006 race_is_Unknown
0.0004 ± 0.0017 rosiglitazone_is_Steady
0.0004 ± 0.0002 diag_1_is_493
0.0003 ± 0.0008 diag_2_is_496
0.0002 ± 0.0005 diag_2_is_585
0.0001 ± 0.0009 pioglitazone_is_No
0.0001 ± 0.0005 diag_1_is_250.6
0.0001 ± 0.0018 A1Cresult_is_None
0.0000 ± 0.0014 diag_3_is_250
-0.0002 ± 0.0011 glipizide_is_No
-0.0002 ± 0.0018 insulin_is_Steady
-0.0002 ± 0.0033 time_in_hospital
-0.0002 ± 0.0021 num_procedures
-0.0003 ± 0.0006 diag_2_is_250
-0.0004 ± 0.0010 diag_2_is_411
-0.0005 ± 0.0004 diag_2_is_707
-0.0008 ± 0.0013 diag_3_is_401
-0.0009 ± 0.0018 gender_is_Female
-0.0010 ± 0.0008 age_is_[90-100)
-0.0011 ± 0.0008 A1Cresult_is_Norm
-0.0012 ± 0.0020 metformin_is_No
-0.0012 ± 0.0018 change_is_Ch
-0.0013 ± 0.0006 insulin_is_Down
-0.0022 ± 0.0018 race_is_AfricanAmerican
-0.0024 ± 0.0009 race_is_Caucasian
-0.0026 ± 0.0021 age_is_[60-70)
-0.0027 ± 0.0014 age_is_[80-90)
In [35]:
eli5.show_prediction(ML_models['DT'], test_row.values[0],feature_names=list(X_test.columns),top=None)
Out[35]:

y=1 (probability 1.000) top features

Contribution? Feature
+0.537 <BIAS>
+0.193 num_lab_procedures
+0.090 time_in_hospital
+0.076 number_inpatient
+0.074 num_procedures
+0.065 diag_2_is_403
+0.062 diabetesMed_is_No
+0.027 diag_2_is_428
+0.021 diag_1_is_250.6
+0.014 number_outpatient
+0.011 number_emergency
+0.008 diag_1_is_428
+0.007 diag_2_is_585
+0.004 diag_3_is_403
+0.003 diag_1_is_491
-0.003 race_is_Unknown
-0.008 A1Cresult_is_Norm
-0.048 number_diagnoses
-0.049 age_is_[60-70)
-0.082 race_is_Caucasian
In [36]:
eli5.show_prediction(ML_models['NN'], test_row.values[0],feature_names=list(X_test.columns),top=None)
Out[36]:
Error: estimator MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08, hidden_layer_sizes=[100, 100, 100, 100, 100], learning_rate='adaptive', learning_rate_init=0.001, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=0, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) is not supported

SHAP

In [37]:
explainer = LinearExplainer(ML_models['LR'], X_train, feature_dependence="independent")
shap_values = explainer.shap_values(test_row.values)
shap.force_plot(explainer.expected_value,
                shap_values,
                test_row.values,
                feature_names=X_test.columns)
The option feature_dependence has been renamed to feature_perturbation!
The option feature_perturbation="independent" is has been renamed to feature_perturbation="interventional"!
Out[37]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [38]:
shap_values = explainer.shap_values(X_test.head(250).values)
shap.force_plot(explainer.expected_value,
                shap_values,
                X_test.head(250).values,
                feature_names=X_test.columns)
Out[38]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [39]:
shap_values = explainer.shap_values(X_test.values)
spplot = shap.summary_plot(shap_values, X_test.values, feature_names=X_test.columns)
In [40]:
top4_cols = ['number_inpatient','number_diagnoses','diabetesMed_is_No','number_emergency']
for col in top4_cols:
    shap.dependence_plot(col, shap_values, X_test)
In [41]:
explainer = TreeExplainer(ML_models['RF'])
shap_values = explainer.shap_values(test_row.values)
shap.force_plot(explainer.expected_value[1],
                shap_values[1],
                test_row.values,
                feature_names=X_test.columns)
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
Out[41]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [42]:
X_train_kmeans = shap.kmeans(X_train, 10)
explainer = KernelExplainer(ML_models['NN'].predict_proba,X_train_kmeans)
shap_values = explainer.shap_values(test_row.values)
shap.force_plot(explainer.expected_value[1],
                shap_values[1],
                test_row.values,
                feature_names=X_test.columns)
Widget Javascript not detected.  It may not be installed or enabled properly.

l1_reg="auto" is deprecated and in the next version (v0.29) the behavior will change from a conditional use of AIC to simply "num_features(10)"!
l1_reg="auto" is deprecated and in the next version (v0.29) the behavior will change from a conditional use of AIC to simply "num_features(10)"!
Out[42]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [ ]: